Morphological Analysis and Diacritical Arabic Text Compression
نویسنده
چکیده
Morphological analysis of Arabic words allows decreasing the storage requirements of the Arabic dictionaries, more efficient encoding of diacritical Arabic text, faster spelling and efficient Optical character recognition. All these factors allow efficient storage and archival of multilingual digital libraries that include Arabic texts. This paper presents a lossless compression algorithm based on the affix analysis that takes advantage of the statistical studies of the diacritical Arabic morphological features. The algorithm decomposes a given Arabic word into its root and its affixes. The affixes (prefix, infix, and suffix) are the redundant elements of the word. The roots are stored in the root dictionary. Also, we maintain categorized affix dictionaries and their valid combinations to validate and generate the morphological forms during encoding and decoding using a list of patterns. Since our goal is lossless reproducible Arabic text, stemming is not an option and noise words (high frequency words) cannot be filtered out. The size of the obtained root dictionary is about 8000 three-character roots and 700 four character roots. We also code the most frequently occurring diacritical bigrams (biliterals) and trigrams (triliterals) with unused codewords in ASCII, ASMO-449, and Unicode standard codes. Using combined methods of root dictionaries and the proposed coding scheme, compression ratios of proper Arabic text compare favorably with other unigram non-diacritical methods. Keyword: Compression, Affixes, Morphological Analysis, Dictionary, Root, Diacritics, lexicon, spelling, archival, digital library.
منابع مشابه
A Compression Technique for Arabic Dictionaries: The Affix Analysis
In every application that concerns the automatic processing of natural language, the problem of the dictionary size is posed. In this paper , we propose a compression dictionary al~orithm based on an affix analysis of the non diacritical Arabic. It consists in decomposing a word into its first elements taking into account the different linguistic transformations that can affect the morphologica...
متن کاملEnhancing Retrieval Effectiveness of Diacritisized Arabic Passages Using Stemmer and Thesaurus
In this paper we discuss the enhancement of Arabic passage retrieval for both diacritisized and nondiacritisized text. Most previous work suggested that retrieval start with pre-processing the Arabic text to remove the diacritical marks (short vowels) to unify the text. In most cases, this process causes considerable ambiguity at the word level in the absence of context. However, searching for ...
متن کامل1 Machine Generation of Arabic Diacritical Marks
The absence of the vowelization marks from the modern Arabic text represents a major obstacle in machine translation and other text understanding applications. In this paper we present a formulation of the problem of automatic generation of the Arabic diacritic marks from unvoweled text using a Hidden Markov Model (HMM) approach. The model considers the word sequence of unvoweled Arabic text as...
متن کاملMachine Generation of Arabic
The absence of the vowelization marks from the modern Arabic text represents a major obstacle in machine translation and other text understanding applications. In this paper we present a formulation of the problem of automatic generation of the Arabic diacritical marks from unvoweled text using a Hidden Markov Model (HMM) approach. The model considers the word sequence of unvoweled Arabic text ...
متن کاملDesign of Arabic Diacritical Marks
Diacritical marks play a crucial role in meeting the criteria of usability of typographic text, such as: homogeneity, clarity and legibility. To change the diacritic of a letter in a word could completely change its semantic. The situation is very complicated with multilingual text. Indeed, the problem of design becomes more difficult by the presence of diacritics that come from various scripts...
متن کامل